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(54) Method and system for tracking human speakers 



(57) A method and system for tracking human 
speakers using a plurality of acoustic sensors arranged 
in an array to detect the voice of the speakers within an 
angular range in order to determine a most favorable 
direction for detecting the voice in a detection period. A 
beamformer Is used to form a plurality of beams each 
covering a different direction within the angular range 
and generate a signal responsive to the voice of the 
speakers for each beam. A comparator is used to peri- 



odically compare the power level of the signal of differ- 
ent beams in order to determine the most favorable de- 
tection direction according to the movement of the hu- 
man speakers. A voice activity detection device is used 
to indicate to the comparator when the voice of the 
speakers is detected so that the comparator determines 
the most favorable detection direction based on the 
voice of the speakers and not the noise when the speak- 
ers are silent during the detection period. 
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Description 

Field of the Invention 

[0001] The present invention relates generally to a mi- 
crophone array that trades the direction of the voice of 
human speakers and. more specifically, to a hands-free 

mobile phone. 

Background of the Invention 

[0002] Mobile phones are commonly used in a car to 
provide the car driver a convenient telecommunication 
means. The user can use the phone while in the car with- 
out stopping the car or pulling the car over to a parking 
area. However, using a mobile phone while driving rais- 
es a safety issue because the driver must constantly ad- 
just the position of the phone with one hand. This may 
distract the driver from paying attention to the driving. 
[0003] A hands-free car phone system that uses a sin- 
gle microphone and a loudspeaker located at a distance 
from the driver can be a solution to the above-described 
problem, regarding the safety issue in driving. However, 
the speech quality of such a hands-free phone system 
is far inferior than the quality usually attainable from a 
phone with a handset supported by the user=s hand. 
The major disadvantages of using the above-described 
hands-free phone system arise from the fact that there 
is a considerable distance between the microphone and 
the user=s mouth and that the noise level in a moving 
car is usually high. The increase in the distance between 
the microphone and the user=s mouth drastically reduc- 
es the speech-to-ambient noise ratio. Moreover, the 
speech is severely reverberated and thus less natural 
and intelligible. 

[0004] A hands-free system with several micro- 
phones, or a multi-microphone system, is able to im- 
prove the speech-to-ambient noise ratio and make the 
speech signal sound more natural without the need of 
bringing the microphone closer to the user's mouth. This 
approach does not compromise the comfort and con- 
venience of the user. 

[0005] Speech enhancement in a multi-microphone 
system can be achieved by an analog or digital beam- 
forming technique. The digital beamforming technique 
involves a beamformer that uses a plurality of digital fil- 
ters to filter the electro-acoustic signals received from a 
plurality of microphones and the filtered signals are 
summed. The beamformer amplifies the microphone 
signals responsive to sound arriving from a certain di- 
rection and attenuates the signals arriving from other di- 
rections. In effect, the beamformer directs a beam of in- 
creased sensitivity towards the source in a selected di- 
rection in order to improve the signal-to-noise ratio of 
the microphone system. Ideally, the output signal of a 
multi-microphone system should sound similar to a mi- 
crophone that is placed next to the user=s mouth. 
[0006] Beamforming techniques are well-known. For 
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example, the article entitled "Voice source localization 
for automatic camera pointing system in videoconfer- 
encing", by H. Wang and P. Chu (Proceedings of IEEE 
1997 Woricshop on Applications of Signal Processing to 

5 Audio and Acoustics. 1997) discloses an algorithm for 
voice source localization. The major drawback of this 
voice source localization algorithm is that it is only ap- 
plicable to a microphone system wherein the space be- 
tween microphones is sufficiently large, 23cm (9") used 

10 in one direction and 30cm (11.7") used in the other di- 
rection. Moreover, the performance of the disclosed mi- 
crophone system is not reliable in an environment where 
the ambient noise levels are high and reverberation is 
severe. 

15 [0007] The article entitled "A signal subspace tracking 
algorithm for microphone array processing of speech**, 
by S. Affes and Y. Grenier (IEEE Transaction on Speech 
and Audio Processing, Vol.5, No.5, pp.425-437, Sep- 
tember 1997) describes a method of adaptive micro- 

20 phone array beamforming using matched filters with 
subspace tracking. 

[0008] The performance of the system as described 
by S. Affes and Y. Grenier is also not reliable when the 
ambient noise levels and reverberation are high. Fur- 

25 thermore, this system only allows the user to move 
slightly, in a drcle of about 10cm (2.54")yadius. Thus, 
the above-described systems cannot reliably perform in 
an environment of a moving car where the ambient 
noise levels are usually high and there can be more than 

30 one human speaker who has a reasonable space to 
move around. 

[0009] U.S. Patent No. 4.741 .038 (Elko et al) disdos- 
es a sound location arrangement wherein a plurality of 
electro-acoustical transducers are used to form a plu- 

35 rality of receiving beams to intercept sound from one or 
more specified directions. In the disclosed arrangement, 
at least one of the beams is steerable. The steerable 
beam can be used to scan a plurality of predetermined 
locations in order to compare the sound from those lo- 

40 cations to the sound from a currently selected location. 
[0010] The article entitled "A self-steering digital mi- 
crophone array", by W. Kellermann (Proceeding of 
ICASSP-91, pp. 3581-3584, 1991) discloses a method 
of selecting the beam direction by voting using a novel 

45 voting algorithm. 

[001 1] The article entitled "Autodirective Microphone 
System" by J. L. Flanagan et al (Acoustica. Vol. 73, pp. 
58-71, 1991) discloses a two-directional beamforming 
system for an auditorium wherein the microphone sys- 

50 tern is dynamically steered or pointed to a desired talker 
location. 

[001 2] However, the above-described systems are ei- 
ther too complicated or they are not designed to perform 

in an environment such as the interior of a moving car 
55 where the ambient noise levels are high and the human 
speakers in the car are allowed to move within a broader 
range. Furthermore, the above-described systems do 
not distinguish the voice from the near-end human 
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speakers from the voice of the far-end human speakers 
through the loudspeaker of a hands-free phone system. 

Summary of the Invention 

[0013] The first aspect of the present invention Is to 
provide a system that uses a plurality of acoustic sen- 
sors for tracking at least one human speaker in order to 
effectively detect the voice from the human speaker, 
wherein the human speaker and the acoustic sensors 
are separated by a speaker distance along a speaker 
direction, and wherein the human speaker is allowed to 
move relative to the acoustic sensors resulting in a 
change in the speaker direction within an angular range, 
and wherein each acoustic sensor produces an electri- 
cal signal responsive to the voice of the human speaker. 
The system comprises: a) a beamformer operatively 
connected to the acoustic sensors to receive the elec- 
trical signal, wherein the beamformer is capable of form- 
ing N different beams, and each of the beams defines a 
favorable direction to detect the voice from the human 
speaker by the acoustic sensors and each different 
beam is directed in a substantially different direction 
within the angular range, and wherein the beamformer 
further outputs for each beam a beam power responsive 
to the voice detected by the acoustic sensors; and b) a 
comparator operatively connected to the beamfomier 
for comparing the beam power of each beam in order to 
determine a most favorable direction to detect the voice 
of the human speaker, wherein the comparator com- 
pares the beam power of each beam periodically so as 
to determine the most favorable detection direction ac- 
cording to the change in the speaker direction. 
[0014] The second aspect of the present invention is 
to provide a method of tracking at least one human 
speaker using a plurality of acoustic sensors in order to 
effectively detect the voice from the human speaker, 
wherein the human speaker and the acoustic sensors 
are separated by a speaker distance along a speaker 
direction, and wherein the human speaker is allowed to 
move relative to the acoustic sensors resulting in a 
change in the speaker direction within an angular range, 
and wherein each acoustic sensor produces an electri- 
cal signal responsive to the voice of the speaker. The 
method includes the steps of: a) forming N different 
beams from the electrical signal such that each beam 
defines a favorable direction to detect the voice of the 
human speaker by the acoustic sensors and each dif- 
ferent beam is directed in a substantially different direc- 
tion within the angular range, wherein each beam has 
a beam power responsive to the electrical signal; and 
b) periodically comparing the beam power of each beam 
in order to detemnine the most favorable direction to de- 
tect the voice of the human speaker according to the 
change of the speaker direction. 
[0015] The present invention will become apparent 
upon reading the description taken in conjunction with 
Figure 1 to Figure 4. 



Brief Description of the Drawings 

[0016] Figure 1 is a schematic representation of a 
speaker tracking system showing a plurality of micro- 
phones arranged in an array to detect the sounds from 
two human speakers located at different distances and 
in different directions. 

[001 7] Figure 2 is a block diagram showing the speak- 
er tracking system being connected to a transceiver to 
be used as a telecommunication device. 
[0018] Figure 3 is a block diagram showing the com- 
ponents of the speaker tracking system, according to 
the present invention. 

[001 9] Figure 4 is a block diagram showing the detail 
of the speaker tracking processor of the present inven- 
tion. 

Detailed Description 

[0020] Figure 1 shows a speaker tracking system 10 
including a plurality of microphones or acoustic sensors 
20 arranged in an array for detecting and processing the 
sounds of human speakers A and B who are located in 
front of the speaker tracking system 10. This arrange- 
ment can be viewed as a hands-free telephone system 
for use in a car where speaker A represents the driver 
and speaker B represents a passenger in the back seat. 
In practice, the speakers are allowed to move within a 
certain range. The movement range of speaker A is rep- 
resented by a loop denoted by reference numeral 102. 
The approximate distance of speaker A from the acous- 
tic sensors 20 is denoted by DA along a general direc- 
tion denoted by an angle DA. Similarty, the movement 
range of speaker B is denoted by reference numeral 
104, and the approximate distance of speaker B from 
the acoustic sensors 20 is denoted by DB along a gen- 
eral direction denoted by an angle □ B. 
[0021 ] The speaker tracking system 1 0 is used to se- 
lect a favorable detection direction to track the speaker 
who speaks. In considering the movement ranges and 
the locations of the speakers, the speaker tracking sys- 
tem 10 is designed to form N different beams to cover 
N different directions within an angular range, □ . Each 
beam approximately covers an angular range of D/N. 
The signal responsive to the voice of the speaker as de- 
tected in the favorable detection direction is denoted as 
output signal 62 (Tx-Signal). The output signal 62 which 
can be in a digital or analog form, is conveyed to the 
transceiver (Figure 2) so that speakers A and B can 
communicate with a far-end human speaker at a remote 
location. 

[0022] Preferably, the acoustic sensors 20 are ar- 
ranged in a single array substantially along a horizontal 
line. However, the acoustic sensors 20 can be arranged 
along a different direction, or in a 2D or 3D array. The 
number of the acoustic sensors 20^ is preferably be- 
tween 4 to 6, when the speaker tracking system 10 is 
used in a relatively small space such as the interior of a 



15 



20 



25 



30 



35 



40 



45 



50 



3 



5 

car. However, the number of the acoustic sensors 20 
can be smaller or greater, depending on the available 
installation space and the desired beam resolution. 
[0023] Preferably, the spacing between two adjacent 
acoustic sensors 20 is about 9cm (3.54"). However, the 5 
spacing between any two adjacent acoustic sensors can 
be smaller or larger depending on the number of acous* 
tic sensors in the an-ay. 

[0024] Preferably, the angular range □ is about 120 
degrees, but it can be smaller or larger. 
[0025] Preferably, the number N of the beams is 6 to 
8, but it can be smaller or larger, depending on the an- 
gular coverage of the beams and the desired beam res- 
olution. It should be noted, however, that the coverage 
angle UIH for each beam is estimated only from the 
main lobe of the beam. In practice, the coverage angle 
of the side-lobes is much larger than 0!H, 
[0026] Figure 2 illustrates a telecommunication de- 
vice 100, such as a hands-free phone, using the speaker 
tracking system of the present invention. The telecom- 
munication device 100 allows one or more near-end hu- 
man speakers (A, B in Figure 1 ) to communicate with a 
far-end human speaker (not shown) at a remote loca- 
tion. As shown, the output signal 62 (Tx-Signal) is con- 
veyed to a transceiver 14 for transmitting to the far-end 
speaker. Optionally, a gain control device 12 is used to 
adjust the power level of the output signal 82. 
[0027] The far-end signal 36 (Rx-Signal) received by 
the transceiver 14 is processed and amplified by an am- 
plifier 16. A loudspeaker 18 is then used to produce 
sound waves responsive to the processed signal so as 
to allow the near-end speakers to hear the voice of the 
far-end speaker in a hands-free fashion. 
[0028] Figure 3 shows the structure of the speaker 
tracking system 10. As shown, each of the acoustic sen- 
sors 20 is operatively connected to an A/D converter 30 
so that the electro-acoustic signals from the acoustic 
sensors 20 are conveyed to a beamformer40 in a digital 
form. The beamformer 40 is used to form N different 
beams covering N different directions in order to cover 
the whole area of interest. The N signal outputs 50 from 
the beamfomier 40 are conveyed to a speaker tracking 
processor 70 which compares the N signal outputs 50 
to determine the highest signal-to-noise levels among 
the N signal outputs 50. Accordingly, the speaker track- 
ing processor 70 selects the most favorable direction for 
receiving the voice of a speaker for a certain time win- 
dow and sends a signal 90 to a beam power selecting 
device, such as an overlap-add device 60. The signal 
90 includes information indicating a direction of arrival 
(DOA) which is the most favorable detection direction. 
The overlap-add device 60, which also receives the N 
output signals 50 from the beamformer 40, selects the 
output signal according to the DOA sent by the speaker 
tracking processor 70. The beamformer 40 updates the 
N output signals 50 with a certain sampling frequency F 
and sends the N updated output signals 50 to the speak- 
er tracking processor 70 In order to update the DOA. If 
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the favorable detection direction of the voice does not 
change, the DOA remains the same and there is no 
need for the overtap-add device 60 to change to a new 
output signal. However, if the speaker moves or a dif- 
ferent speaker speaks, the speaker tracking processor 
70 sends out a new DOA signal 90 to the overlap-add 
device 60. Accordingly, the overlap-add device selects 
a new output signal among the N updated output signals 
50. In order to avoid an abrupt change in the sound level 
as conveyed by the Tx-Signal 62, an overlap-add pro- 
cedure is used to join the successive beamformer output 
segments corresponding to the old beam direction and 
the new beam direction. 

[0029] In a high ambient noise environment, it is pre- 
ferred that a voice activity detection device is used to 
differentiate noise from the voice of the human speak- 
ers. With a preset detection level, a voice activity detec- 
tion device can indicate when a voice is present and 
when there is only noise. As shown, a voice-activity de- 
tector (Tx-VAD) 32 is used to send a signal 33 to the 
speaker tracking processor 70 to indicate when any of 
the human speakers speaks and when it is a noise-only 
period. Similarly, a voice-activity detector (Rx-VAD) 34 
is used to send a signal 35 to the speaker tracking proc- 
essor 70 to indicate when the voice of the far-end human 
speaker is detected and when there is no voice but noise 
from the far-end being detected. It should be noted that 
the voice-activity detector 34 can be implemented near 
the speaker tracking processor 70 as part of the speaker 
tracking system 10, or it can be implemented at the 
transmitter end of the far-end human speakers for send- 
ing information related to the signal 35 to the speaker 
tracking processor 70. 

[0030] Moreover, a signal 36 responsive to the sound 
from the far-end human speaker is conveyed to the 
speaker tracking processor 70 for babble noise estima- 
tion, as described in conjunction with Figure 4 below. 
[0031] Preferably, the update frequency F for direc- 
tion estimation is between 8Hz to SOHz, but it can be 
higher or lower. With an update frequency of 8Hz to 
50Hz. the DOA as indicated by signal 90 generally rep- 
resents a favorable detection direction within a time win- 
dow defined by 1/F, which ranges from 20ms to 125ms, 
[0032] Optionally, the speaker tracking system 10 is 
allowed to operate in at least two different modes: 

1) in the ON mode, the speaker tracking system 10 
periodically compares the beam power of each 
beam in order to determine the most favorable de- 
tection direction in a continuous fashion, and 

2) in the FREEZE mode, the speaker tracking sys- 
tem 1 0 is allowed to compare the beam power of 
each beam in order to determine the most favorable 
detection direction within a short period of steering 
time (a few seconds, for example) and the most fa- 
vorable detection direction so determined is kept 
unchanged after the steering time has expired. 
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[0033] In order to select the operating modes, a 
MODE control device 38 is used to send a MODE signal 
39 to the speaker tracking processor 70. 
[0034] Preferably, the overfap-add procedure for di- 
rectional smoothing is carried out as follows. The over- 5 
lapping frames of the output of the beamformer 40 are 
windowed by a trapezoidal window and the overlapping 
windowed frames are added together. The slope of the 
window must be long enough to smooth the effect aris- 
ing from the changing of beam directions. Accordingly, io 
the Tx-Signal 62 represents the beam power of a select- 
ed beam within a sampling window as smoothed by the 
overlap-add device 60. A typical overlap time is about 
50ms. 

[0035] Figure 4 illustrates the direction estimation 15 
process carried out by the speaker tracking processor 
70 of the present invention. As shown, each of the N 
output signals 50 is filtered by a band-pass filter (BPF) 
72 in order to optimize the speech-to-ambient noise ratio 
and the beamformer directivity. Preferably, the band- 20 
pass filter frequency is about 1 kHz to 2kHz. The band- 
pass filtered signals 73 are simultaneously conveyed to 
a noise-level estimation device 82 and a beam-power 
estimation device 74. A Tx-VAD signal 33 provided by 
the voice-activity detector 32 (Figure 3) is conveyed to 25 
both the noise-level estimation device 82 and a compar- 
ison and decision device 80. 

[0036] During noise-only periods as indicated by the 
Tx-VAD signal 33, the noise-level estimation device 82 
estimates the noise power level of each beam. Prefer- 30 
ably, the noise power level is estimated in a first order 
Infinite Impulse Response (MR) process as described by 
the following equation: 

where n^j is the noise power level of the beam at the 
sampling window / ; is a constant which is smaller 
than 1 ; and is the /"^ frame of the k^^ band-pass fil- 40 
tared beamformer output among the N filtered signal 
outputs 73 during a noise-only period. FLEA/ is the frame 
length used for noise estimation. The estimated noise 
power level 83 for each of the N beams is conveyed to 
the beam-power estimation device 74. 45 
[0037] Based on the band-pass filtered beamformer 
outputs 73 and the estimated noise power levels 83 for 
the corresponding beam directions, the beam-power es- 
timation device 74 estimates the power level of each 
band-pass filtered output 73 in a process described by 50 
the following equation: 

Pk,M =^ Pkj + (1 - Pz) 3 FLEN / * "k,i) ' 

55 

where p^^ , is the power level of the 1^^ beam at the sam- 
pling window and is a constant which is smaller 
than 1 . Each power level is normalized with the noise 
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level of the corresponding beam. This normalization 
process helps bring out only the power differences in 
the voice in different beam directions. It also helps re- 
duce the differences in the power level due to the pos- 
sible differences in the beam-width. 
[0038] The estimated power levels Pi^j^^ (k=tN) are 
then sent out as signals 75 to a power level adjuster 78 
for selecting the next DOA. 

[0039] Because the distance from the acoustic sen- 
sors 20 to speaker A is shorter than the distance from 
the acoustic sensors 20 to speaker B. the sound level 
from speaker B as detected by the acoustic sensors is 
generally lower. It is possible that a different weighting 
factor is given to the signals responsive to the sound of 
different speakers in order to increase the level of a 
weaker sound. For example, a weighting factor of 1 can 
be given to the beams that are directed towards the gen- 
eral direction of speaker B, while a weighting factor of 
0.8 can be given to the beams that are directed towards 
the general direction of speaker A. However, if it is de- 
sirable to design a speaker tracking system that favors 
the driver, then a larger weighting factor can be given to 
the beams that are directed towards to the general di- 
rection of the driver. 

[0040] The adjusted power levels 79 are conveyed to 
a comparison and decision device 80. 
[0041] It is also possible that the normalization device 
76 and the power level adjuster 78 are omitted so that 
the estimated power levels from the beam-power esti- 
mation device 74 are directly conveyed to the compari- 
son and decision device 80. Based on the power levels 
received, the comparison and decision device 80 select 
the highest power. 

[0042] When the far-end speaker in a remote location 
talks to the near-end speakers (A,B in Figure 1), the 
voice of the far-end speaker is reproduced on a loud- 
speaker of a hands-free telecommunication device (Fig- 
ure 2). The voice of the far-end speaker on the loud- 
speaker may confuse the comparison and decision de- 
vice 80 as to which should be used for the selection of 
the DOA. Thus, it is preferable to select the next DOA 
when the far-end speaker is not talking. For that pur- 
pose, it is preferable to use a babble noise detector 84 
to generate a signal 85 indicating the speech inactivity 
period of the far-end human speaker, based on the Rx- 
Signal 36 and the Rx-VAD Signal 35. Furthermore, it is 
preferable to select the next DOA during a period of 
near-end speech activity (Tx-VAD=1 ). When Tx-VAD=1 , 
someone in the car is talking. 

[0043] Thus, only during near-end speech activity and 
far-end speech inactivity, the power level of the currently 
selected direction is compared to those of the other di- 
rections. If one of the other directions has a cleariy high- 
er level, such as the difference is over 2dBs, for exam- 
ple, that direction is chosen as the new DOA. The major 
advantage of the VADs Is that the speaker tracking sys- 
tem 10 only reacts to true near-end speech and not 
some noise-related impulses, or the voice of the far-end 
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speaker from the loudspeaker of the hands-free phone 

system. 

[0044] Thus, what has been described is a speaker 
tracking system which can be used in a hands-free car 
phone. However, the same speaker tracking system can 
be used in a different environment such as a video or 
teleconferencing situation. Furthennore, in the descrip- 
tion taken in conjunction with Figures 1 to 4, the electro- 
acoustic signals from the acoustic sensors 20 are con- 
veyed to the beamformer 40 in a digital fomi. Accord- 
ingly, a fiiter-and-sum beamforming technique is used 
for beamfonning. However, the electro-acoustic signals 
from the acoustic sensors 20 can also be conveyed to 
the beamformer 40 in an analog form. Accordingly, a de- 
lay-and-sum technique can be used for beamforming. 
Also, the acoustic sensors 20 can be arranged in a 2D 
array or a 3D array in different arrangements. It is un- 
derstood that the beamforming process using a single 
array can only steer a beam in one direction, the beam- 
forming process using a 2D array can steer the beam in 
two directions. With a 3D acoustic sensor array, the dis- 
tance of the speaker can also be determined, in addition 
to the directions. In spite of the complexity of a system 
the uses a 2D or 3D acoustic sensor an-ay, the principle 
of speaker tracking remains the same. 
[0045] Therefore, although the invention has been de- 
scribed with respect to a preferred embodiment thereof, 
it will be understood by those skilled in the art that the 
foregoing and various other changes, omissions and de- 
viations in the form and detail thereof may be made with- 
out departing from the spirit and scope of this invention. 



Claims 

1. A system having a plurality of acoustic sensors for 
tracking at least one human speaker in order to ef- 
fectively detect a voice from the human speaker, 
wherein the human speaker and the acoustic sen- 
sors are separated by a speaker distance along a 
speaker direction and wherein the human speaker 
is allowed to move relative to the acoustic sensors 
resulting in a change in the speaker direction within 
an angular range, and wherein each acoustic sen- 
sor produces an electrical signal responsive to the 
voice of the human speaker, said system compris- 
ing: 

a) a beamformer operatively connected to the 
acoustic sensors to receive the electrical sig- 
nal, wherein the beamformer Is capable of 
forming N different beams, wherein N is a pos- 
itive integer greater than 1 and each beam de- 
fining a favorable direction to detect the voice 
from the human speaker by the acoustic sen- 
sors and each different beam is directed in a 
substantially different direction within the angu- 
lar range, said beamfomner further outputting 



an output signal representative of a beam pow- 
er for each beam when the acoustic sensors 
detect the voice; and 

b) a comparator operatively connected to said 
5 beamformer for comparing the beam power of 

each beam in order to determine a most favo- 
rable direction to detect the voice of the human 
speaker, wherein said comparator compares 
the beam power of each beam periodically so 
10 as to detennine the most favorable direction to 

detect the voice of the human speaker accord- 
ing to the change in the speaker direction. 

2. The system of claim 1 , further comprising a voice 
15 activity detection device to defme a near-end 

speech activity period when the voice from the hu- 
man speaker is detected during a detection period, 
wherein the comparator determines the most favo- 
rable detection direction only in the near-end 
20 speech activity period. 

3. The system of claim 2, wherein the voice activity 
detection device further defines a near-end speech 
inactivity period when the voice of the human 

25 speaker using the system is not detected in the de- 
tection period, said system further comprising a 
power estimation device and a noise level estima- 
tion device operatively connected to the beamform- 
er to receive the beam power of each beam, where- 

30 in the noise level estimation device estimates an 
ambient noise in the near-end speech inactivity pe- 
riod and wherein the power estimation device esti- 
mates the beam power of each beam based on the 
ambient noise and provides the estimated beam 

35 power to the comparator in order for the comparator 
to determine the most favorable detection direction. 

4. The system of claim 3, further comprising a band- 
pass filter to filter the output signal from the beam- 

40 former prior to the beam power of each beam from 
the beamformer being conveyed to the power esti- 
mation device and the noise level estimation de- 
vice. 

45 5. The system of claim 1 , further comprising a trans- 
ceiver to transmit the voice of the human speaker 
utilizing the system to communicate with a far-end 
human speaker in a communication period and to 
receive a far-end voice signal responsive to a voice 

50 from the far-end human speaker during the commu- 
nication period. 

6. The system of claim 5, further comprising a voice 
activity detection device to indicate a far-end 
55 speech inactivity period, wherein the comparator 
determines the most favorable detection direction 
only in the far-end speech inactivity period. 
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7. The system of claim 6. further comprising a babble 
noise detector for receiving the far-end voice signal 
and a signal from the voice activity detection device 
in order to generate a far-end speech inactivity sig- 
nal conveyed to the comparator for indicating the 
far-end speech inactivity period. 

8. The system of claim 1, wherein the comparator 
compares the beam power of each beam within a 
predetermined time period in order to detennine the 
most favorable detection direction and wherein the 
most favorable detection direction determined with- 
in the predetenmined time period is kept unchanged 
after the predetermined time period has expired. 

9. A method of tracking at least one human speaker 
using a plurality of acoustic sensors in order to ef- 
fectively detect a voice from the human speaker, 
wherein the human speaker and the acoustic sen- 
sors are separated by a speaker distance along a 
speaker direction and wherein the human speaker 
is allowed to move relative to the acoustic sensors 
resulting in a change in the speaker direction within 
an angular range, and wherein each acoustic sen- 
sor produces an electrical signal responsive to the 
voice of the human speaker, said method compris- 
ing the steps of: 

a) forming N different beams from the electrical 
signal, wherein N is a positive integer greater 
than 1 and each beam defining a favorable di- 
rection to detect the voice of the human speak- 
er by the acoustic sensors and each different 
beam is directed in a substantially different di- 
rection within the angular range, wherein each 
beam has a beam power responsive to the 
electrical signal; and 

b) periodically comparing the beam power of 
each beam in order to determine a most favo- 
rable direction to detect the voice of the human 
speaker according to the change of the speaker 
direction. 

10. The method of claim 9, further comprising the step 
of determining a near-end speech activity period 
when the voice of the human speaker is detected in 
a detection period so that the beam power of each 
beam in the detection period is compared to deter- 
mine the most favorable detection direction only in 
the near-end speech activity period. 

11. The method of claim 10, further comprising the 
steps of: 

determining a near-end speech inactivity peri- 
od when the voice of the human speaker is not 
detected in the detection period; 
estimating an ambient noise during the near- 



end speech inactivity period; and 
estimating the beam power of each beam 
based on the ambient noise so as to use the 
estimated beam power of each beam to deter- 
5 mine the most favorable detection direction. 

12. The method of claim 9, wherein the human speaker 
communicates with a far-end human speaker at a 
remote location in a communication period, said 

10 method further comprising the step of receiving a 
voice of the far-end human speaker and indicating 
a far-end speech inactivity period so that the beam 
power of each beam is compared to determine the 
most favorable detection direction only in the far- 

15 end speech inactivity period. 

13. The method of claim 9, wherein each beam has a 
beamformer output related to the power of said 
beam, the method further comprising the step of fre- 

20 quency filtering the beamformer output of each 
beam prior to determining the most favorable de- 
tection direction In step 2. 

14. The method of claim 13, wherein the beamformer 
25 output is filtered by a band-pass filter having a fre- 
quency range of 1 kHz and 2kHz. 

1 5. The method of claim 1 3, wherein the most favorable 
detection direction is updated with a frequency 

30 ranging from 8Hz to 50Hz in order to obtain a new 
most favorable detection direction for replacing a 
current most favorable detection direction, said 
method further comprising the step of adding a part 
of the beamformer output of the current most favo- 

35 rable detection direction to a part of the beamformer 
output of the new most favorable detection direction 
In order to smooth out the beamfomier output dif- 
ference between the new and the current most fa- 
vorable detection direction. 

40 

16. The method of claim 9, wherein the beam power 
comparing step is carried out within a predeter- 
mined time period and wherein the most favorable 
direction so determined is kept unchanged after the 

45 predetermined time period has expired. 

17. The method of claim 9, wherein the acoustic sen- 
sors are arranged in a single array. 

50 18. The method of claim 9, wherein the acoustic sen- 
sors are arranged In a 2D array. 

19. The method of claim 9, wherein the acoustic sen- 
sors are arranged in a 3D array. 

55 

20. The method of claim 9, wherein the most favorable 
detection direction is updated with a frequency 
ranging from 8Hz to 50Hz in order to obtain a new 
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most favorable detection direction for replacing a 
current most favorable detection direction. 

21. The method of claim 20, wherein the current most 
favorable detection direction is replaced by the new 5 
most favorable detection direction only when the 
beam power difference between the new and the 
current most favorable detection direction exceeds 
2dBs. 

10 

22. The method of claim 9. further comprising the step 
of assigning a weighting factor to each beam ac- 
cording to the speaker direction in order to adjust 
the beam power of each beam prior to determining 
the most favorable detection direction in step 2. 

23. The method of claim 9, further comprising the step 
of assigning a weighting factor to each beam ac- 
cording to the speaker location in order to adjust the 
beam power of each beam prior to determining the 20 
most favorable detection direction in step 2. 

24. A system having a plurality of acoustic sensors (20) 
for detecting and tracking the voice of a human 
speaker that can move relative to the sensors so as 25 
to be separated by a speaker distance along a 
speaker direction and within an angular range, and 
wherein each acoustic sensor produces an electri- 
cal signal responsive to the voice of the human 
speaker, characterised by 20 

a beamfomner (40) operatively connected to the 
acoustic sensors to receive the electrical sig- 
nal, wherein the beamformer is capable of 
forming N different beams, wherein N is a pos- 35 
itive integer greater than 1 and each beam de- 
fining a favorable direction to detect the voice 
from the human speaker by the acoustic sen- 
sors and each different beam is directed in a 
substantially different direction within the angu- ^0 
lar range, said beamformer further outputting 
an output signal (75) representative of a beam 
power for each beam when the acoustic sen- 
sors detect the voice; and 

a comparator (80) operatively connected to 45 
said beamformer for comparing the beam pow- 
er of each beam in order to determine a most 
favorable direction to detect the voice of the hu- 
man speaker. 
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